Table Metadata: Headers, Augmentations and Aggregates
نویسندگان
چکیده
A sample of 200 web tables was interactively converted into layout-independent Augmented Wang Notation (AWN) using the Table Abstraction Tool (TAT). The resulting XML ground-truth files list for each table (1) cell contents, (2) relationships between the hierarchical column and row headers and the value/content/data cells, (3) designators for aggregates like totals and averages, and (4) ancillary information (augmentations) represented by table titles and captions, footnotes, and unit indicators. On average, these tables have 585 cells, 8.8 footnotes, and 1.4 rows of aggregates. They differ widely in number of cells, Wang dimensionality, and MHTML and AWN/XML file sizes. Even though TAT automates much of the repetitive work, interactive ground-truthing took on average four minutes per table. The collected ground truth is offered to the research community for experimentation on automated table processing and for realistic pseudo-random generation of table data.
منابع مشابه
Interactive Conversion of Web Tables
Two hundred web tables from ten sites were imported into Excel. The tables were edited as needed, then converted into layout independent Wang Notation using the Table Abstraction Tool (TAT). The output generated by TAT consists of XML files to be used for constructing narrow-domain ontologies. On an average each table required 104 seconds for editing. Augmentations like aggregates, footnotes, t...
متن کاملA Flexible Table Parsing Approach
Relational data is often encoded in tables. Tables are easy to read by humans, but difficult to interpret automatically. In cases where table layout cues are not obtainable (missing HTML tags) or where columns are distorted (by copying from a spreadsheet to text) previous table extraction approaches run into problems. This paper introduces a novel table parsing approach. Our approach is based o...
متن کاملAbstractive Tabular Dataset Summarization via Knowledge Base Semantic Embeddings
is paper describes an abstractive summarization method1 for tabular datawhich employs a knowledge base semantic embedding to generate the summary. Assuming the dataset contains descriptive text in headers, columns and/or some augmenting metadata, the system employs the embedding to recommend a subject/type for each text segment. Recommendations are aggregated into a small collection of super t...
متن کاملTable Header Detection and Classification
In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of .that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table...
متن کاملAnnotating Table Headers Based on Semantic Web Resources
—Tables offer an often used way to represent information for the human reader. But as long as those tables are not annotated with semantic information they are meaningless to machines. In this work a methodology is proposed to annotate the headers of table columns with semantic types by creating a ranking of possible column headers based on the column cells. In the performed experiments on 10 i...
متن کامل